Information Integration of Partially Labeled Data

نویسندگان

  • Steffen Rendle
  • Lars Schmidt-Thieme
چکیده

A central task when integrating data from different sources is to detect identical items. For example, price comparison websites have to identify offers for identical products. This task is known, among others, as record linkage, object identification, or duplicate detection. In this work, we examine problem settings where some relations between items are given in advance – for example by EAN article codes in an e-commerce scenario or by manually labeled parts. To represent and solve these problems we bring in ideas of semi-supervised and constrained clustering in terms of pairwise must-link and cannot-link constraints. We show that extending object identification by pairwise constraints results in an expressive framework that subsumes many variants of the integration problem like traditional object identification, matching, iterative problems or an active learning setting. For solving these integration tasks, we propose an extension to current object identification models that assures consistent solutions to problems with constraints. Our evaluation shows that additionally taking the labeled data into account dramatically increases the quality of state-of-the-art object identification systems.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

Adaptive Information Analysis in Higher Education Institutes

Information integration plays an important role in academic environments since it provides a comprehensive view of education data and enables mangers to analyze and evaluate the effectiveness of education processes. However, the problem in the traditional information integration is the lack of personalization due to weak information resource or unavailability of analysis functionality. In this ...

متن کامل

PathLog: a Query Language for Schemaless Databases of Partially Labeled Objects

In the paper we deal with the problem of modeling and querying information in schemaless databases of partially labeled objects (PLO-DB). Partially labeled objects are used for modeling data within repositories integrating both structured and semistructured data. The proposed PLO (Partially Labeled Objects) data model originates from the OEM data model and extends it by allowing partial labelin...

متن کامل

Integration and Reduction of Microarray Gene Expressions Using an Information Theory Approach

The DNA microarray is an important technique that allows researchers to analyze many gene expression data in parallel. Although the data can be more significant if they come out of separate experiments, one of the most challenging phases in the microarray context is the integration of separate expression level datasets that have gathered through different techniques. In this paper, we prese...

متن کامل

Critical Success Factors for Data Virtualization: A Literature Review

Data Virtualization (DV) has become an important method to store and handle data cost-efficiently. However, it is unclear what kind of data and when data should be virtualized or not. We applied a design science approach in the first stage to get a state of the art of DV regarding data integration and to present a concept matrix. We extend the knowledge base with a systematic literature review ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007